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DIGITAL SEARCH TREES AND CHAOS GAME REPRESENTATION 

Peggy Cenac 1 , Brigitte Chauvin 2 , Stephane Ginouillac 2 and Nicolas 

pouyanne 2 

Abstract. In this paper, we consider a possible representation of a DNA sequence in a quaternary 
tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR- 
tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the 
reversed sequence. Several results are known concerning the height, the insertion depth for DST built 
from independent successive random sequences having the same distribution. Here the successive 
inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and 
the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian 
sequence. This behaviour turns out to be at first order the same one as in the case of independent 
words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are 
obtained. 

Resume. La representation definie ici est une representation possible de sequence dADN dans un 
arbre quaternaire dont la construction permet de visualiser les repetitions de suffixes. A partir d'une 
sequence de lettres, on construit un arbre digital de recherche (Digital Search Tree) sur l'ensemble 
des suffixes de la sequence inversee. Des resultats sur la hauteur et la profondeur d'insertion ont 
ete etablis lorsque les sequences a placer dans l'arbre sont independantes les unes des autres. Ici les 
mots a inserer sont fortement dependants. On donne le comportement asymptotique de la profondeur 
d'insertion et de la longueur des branches pour un arbre obtenu a partir des suffixes d'une sequence 
i.i.d. ou markovienne retournee. Au premier ordre, cette asymptotique est la meme que dans le cas ou 
les mots inseres sont independants. De plus, certains resultats peuvent aussi s'interpreter comme des 
resultats de convergence sur les longueurs de plus longues repetitions d'une lettre dans une sequence 
Markovienne. 
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1. Introduction 

In the last years, DNA has been represented by means of several methods in order to make pattern 
visualization easier and to detect local or global similarities (see for instance Roy et al. H3l)- The 
Chaos Game Representation (CGR) provides both a graphical representation and a storage tool. From 

Keywords and phrases: Random tree, Digital Search Tree, CGR, lengths of the paths, height, insertion depth, asymp- 
totic growth, strong convergence 

1 INRIA Rocquencourt and Universite Paul Sabatier (Toulouse III) - INRIA Domaine de Voluceau B.P.105 78 153 
Le Chesnay Cedex (France) 

2 LAMA, UMR CNRS 8100, Batiment Fermat, Universite de Versailles - Saint-Quentin F- 78035 Versailles 

© EDP Sciences, SMAI 1999 



2 



TITLE WILL BE SET BY THE PUBLISHER 



a sequence in a finite alphabet, CGR defines a trajectory in a bounded subset of M. d that keeps all 
statistical properties of the sequence. Jeffrey was the first to apply this iterative method to 
DNA sequences. Cenac Cenac et al. Q study the CGR with an extension of word-counting based 
methods of analysis. In this context, sequences are made of 4 nucleotides named A (adenine), C 
(cytosine), G (guanine) and T (thymine). 

The CGR of a sequence U\ . . . U n . . . of letters U n from a finite alphabet A is the sequence (X n ) n >o 
of points in an appropriate compact subset S of M d defined by 

X G S 

where 9 is a real parameter (0 < < 1), each letter u E A being assigned to a given point l u E S. In 
the particular case of Jeffrey's representation, A = {A, C, G, T} is the set of nucleotides, S = [0, l] 2 is 
the unit square. Each letter is placed at a vertex as follows: 

l A = (P,0), e c = (o,i), £ G = (M), 4r = (l,0), 

= i and the first point Xq is the center of the square. Then, iteratively, the point X n+ \ is the 
middle of the segment between X n and the square's vertex £u n+1 '■ 

_X n + £ Un+1 
<*n+l — 2~ ' 

or, equivalently, 

k=l 

Figure Q represents the construction of the word ATGCGAGTGT. 

With each deterministic word w = u\ . . . u n , we associate the half-opened subsquare Sw defined by 
the formula 

n P 1 

defy ^ 2 

k=l 

it has center J2k=i ^u k /2 n ~ k+1 + X /2 n and side l/2 n . For a given random or deterministic sequence 
U± . . .U n . . ., for any word w and any n > \w\ (the notation \w\ stands for the number of letters 
in w), counting the number of points (Xi)i<i< n that belong to the subsquare Sw is tantamount to 
counting the number of occurences of w as a subword of U\ . . . U n . Indeed, all successive words from 
the sequence having w as a suffix are represented in Sw. See Figured] for an example with three-letter 
subwords. This provides tables of word frequencies (see Goldman [IJ]). One can generalize it to any 
subdivision of the unit square; when the number of subsquares is not a power of 4 v _the table of word 
frequencies defines a counting of words with noninteger length (see Almeida et al. [2|). 

The following property of the CGR is important: the value of any X n contains the historical 
information of the whole sequence X\, . . . X n . Indeed, notice first that, by construction, X n G Su 
with U n = u; the whole sequence is now given by the inductive formula X n _\ = 2X n — £jj n . 

We define a representation of a random DNA sequence U = (U n ) n >i as a random quaternary 
tree, the CGR-tree, in which one can visualize repetitions of subwords. We adopt the classical order 
(A, C, G, T) on letters. Let T be the complete infinite 4-ary tree; each node of T has four branches 
corresponding to letters (A, C, G, T) that are ordered in the same way. The CGR-tree of U is an 
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Figure 1 . Chaos Game Representation of the first 10 nucleotides of the E. Coli thre- 
onine gene thrA: ATGCGAGTGT. The coordinates for each nucleotide are calculated 
recursively using (0.5, 0.5) as starting position. The sequence is read from left to right. 
Point number 3 corresponds to the first 3-letter word ATG. It is located in the cor- 
responding quadrant. The second 3-letter word TGC corresponds to point 4 and so 
on. 

increasing sequence T\ C T2 . . . C T n C ... of finite subtrees of T, each T n having n nodes. The 
T n 's are built by successively inserting the reversed prefixes 

W(n) = U n ...U 1 (1) 

as follows in the complete infinite tree. First letter W(l) = U\ is inserted in the complete infinite tree 
at level 1, i.e. just under the root, at the node that corresponds to the letter U\. Inductively, the 
insertion of the word W(n) = U n . . . XI \ is made as follows: try to insert it at level 1 at the node M 
that corresponds to the letter U n . If this node N is vacant, insert W(n) at N ; if N is not vacant, try 
to insert W(n) in the subtree having J\f as a root, at the node that corresponds to the letter U n -\, 
and so on. One repeats this operation until the node at level k that corresponds to letter U n ^.h+i is 
vacant; word W(n) is then inserted at that node. 
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We complete our construction by labelling the n-th inserted node with the word W(n). One readily 
obtains this way the process of a digital search tree (DST), as stated in the following proposition. 

Figure 2 shows the very first steps of construction of the tree that corresponds to any sequence that 
begins with GAGC AC AGTGGAAGGG. The insertion of this complete 16-letter prefix is represented 
in Figure 3. In these figures, each node has been labelled by its order of insertion to make the example 
more readable. 

Proposition 1.1. The CGR-tree of a random sequence U = U\Ui ... is a digital search tree, obtained 
by insertion in a quaternary tree of the successive reversed prefixes U\, U^Ui, U3U2U1, ...of the 
sequence. 

The main results of our paper are the following convergence results, the random sequence U being 
supposed to be Markovian. If £ n and L n denote respectively the length of the shortest and of the longest 
branch of the CGR-tree, then l n /lnn and C n / Inn converge almost surely to some constants (Theorem 
I3.1JI . Moreover, if D n denotes the insertion depth and if M n is the length of a uniformly chosen random 
path, then D n /\an and M n /\nn converge in probability to a common constant (Theorem I4.1jl . 

Remark 1.2. A given CGR-tree without its labels (i.e. a given shape of tree) is equivalent to a list 
of words in the sequence without their order. More precisely, one can associate with a shape of CGR- 
tree, a representation in the unit square as described below. With any node of the tree (which is in 
bijection with a word w = W\ ■ ■ ■ Wd), we associate the center of the corresponding square Sw, 



For example, Figure 3 shows this li historyless representation" for the word GAGC AC AGTGGAAGGG . 
Moreover Figure 4 enables us to qualitatively compare the original and the historyless representations 
on an example. 

Several results are known (see chap. 6 in Mahmoud ^J), concerning the height, the insertion depth 
and the profile for DST obtained from independent successive sequences, having the same distribution. 
It is far from our situation where the successive inserted words are strongly dependent from each other. 
Various results concerning the so-called Bernoulli model (binary trees, independent sequences and the 
two letters have the same probability 1/2 of appearance) can be found in Mahmoud Aldous 
and Shields [lj prove by embedding in continuous time, that the height satisfies H n — log 2 n — * in 
probability. Also Drmota Q proves that the height of such DSTs is concentrated: E[i? n — E,(H n )] L is 
asymptotically bounded for any L > 0. 

For DST constructed from independent sequences on an m-letter alphabet with nonsymmetric (i.e. 
non equal probabilities on the letters) i.i.d or Markovian sources, Pittel gets several results on the 
insertion depth and on the height. Despite the independence of the sequences, Pittel's work seems to 
be the closest to ours, and some parts of our proofs are inspired by it. 

Some proofs in the sequel use classical results on the distribution of word occurences in a random 
sequence of letters (independent or Markovian sequences). Blom and Thorburn |4| give the generating 
function of the first occurence of a word for i.i.d. sequences, based on a recurrence relation on the 
probabilities. This result is extended to Markovian sequences by Robin and Daudin [2f|. Several 
studies in this domain are based on generating functions, for example Regnier [24|, Reinert et al. 
(2f|, Stefanov and Pakes [29]. Nonetheless, other approaches are considered: one of the more general 
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Figure 2. Insertion of a sequence GAGCACAGTGGAAGGG. . . in its CGR-tree: first, 
second, third and seventh steps. 
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techniques is the Markov chain embedding method introduced by Fu and further developped by 
Fu and Koutras ^3], Koutras A martingale approach (see Gerber and Li [3], Li [2], Williams 

30]) is an alternative to the Markov chain embedding method to solve problems around Penney |2oT ] 
Game. These two approaches are compared in Pozdnyakov et al. [2j|. Whatever method one uses, 
the distribution of the first occurence of a word strongly depends on its overlapping structure. This 
dependence is at the core of our proofs. 

As a by-product, our results yield asymptotic properties on the length of the longest run, which is a 
natural object of study. In i.i.d. and symmetric sequences, Erdos and Revesz [j| establish almost sure 
results about the growth of the longest run. These results are extended to Markov chains in Samarova 

28], and Gordon et al. |15( show that the probabilistic behaviour of the length of the longest run is 
closely approximated by that of the maximum of some i.i.d. exponential random variables. 

The paper is organized as follows. In Section |2] we establish the assumptions and notations we use 
throughout. Section |3] is devoted to almost sure convergence of the shortest and the longest branches 
in CGR-trees. In Section |1] asymptotic behaviour of the insertion depth is studied. An appendix deals 
separately with the domain of definition of the generating function of a certain waiting time related 
to the overlapping structure of words. 



2. Assumptions and notations 



In all the sequel, the sequence U = U% . . . U n . . . is supposed to be a Markov chain of order 1, with 
transition matrix Q and invariant measure p as initial distribution. 

For any deterministic infinite sequence s, let us denote by the word formed by the n first letters 
of s, that is to say = si . . . s n , where s« is the i-th letter of s. The measure p is extended to 
reversed words the following way: 

p( s W) d ^ p([/ 1 = Sn ,...,U n = si). The need for reversing the 
word comes from the construction of the CGR-tree which is based on reversed sequences (Q). 
We define the constants 



h- 



del 



dot 



def 



lim -maxim ( , \ ), p(s {n) ) > o), 

n^+oo n I \P{S^) J > 

lim -min(ln( . \ ,. ), _p(s (n) ) > o), 



lim — E 

n— >+oo n 



In 



Due to an argument of sub-additivity (see Pittel 22}), these limits are well defined (in fact, in a more 
general than Markovian sequences framework). Moreover, Pittel proves the existence of two infinite 
sequences denoted here by s + and s_ such that 



h + = lim — In 

n— >oo n 



For any n > 1, the notation T n = T n (W) stands for the finite tree with n nodes (without counting 
the root), built from the first n sequences . . . , W(n), which are the successive reversed prefixes 

of the sequence (U n ) n , as defined by To denotes the tree reduced to the root. In particular, the 
random trees are increasing: Tq C T\ . . . C T n C . . . C T. 



p(s 



and h- 



lim — In ( 



(2) 
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Let us define £ n (resp. C n ) as the length of the shortest (resp. the longest) path from the root 
to a feasible external node of the tree T n (w). Moreover, D n denotes the insertion depth of W{n) in 
T n _i to build T n . Finally M n is the length of a path of T n , randomly and uniformly chosen in the n 
possible paths. 

The following random variables play a key role in the proofs. For the sake of precision, let us recall 
that s is deterministic, the randomness is uniquely due to the generation of the sequence U. First we 
define for any infinite sequence s and for any n > 0, 

X ( s ) ^ f / if Sl is not in ^ n (3) 
\ max{fc such that is already inserted in T n }. 

Notice that Xq(s) = 0. Every infinite sequence corresponds to a branch of the infinite tree T (root 
at level 0, node that corresponds to si at level 1, node that corresponds to s 2 at level 2, etc.); the 
random variable X n (s) is the length of the branch associated with s in the tree T n . For any k > 0, 
Tfc(s) denotes the size of the first tree where is inserted: 

T k (s) = min{n, X n (s) = k} 

(notice that T (s) = 0). 

These two variables are in duality in the following sense: one has equality of the events 



{X n (s) >k} = {T k (s) < n} (4) 

and consequently, {T k (s) = n} C {X n (s) = k} since X n (s) — X n -i(s) G {0, 1}. 

In our example of Figures 2 and 3, the drawn random sequence is GAGC AC AGTGGAAGGG . . . 
If one takes a deterministic sequence s such that = AC A, then X (s) = Xx(s) = 0, X 2 (s) = 
X 3 (s) = X 4 (s) = 1, X 5 (s) = X 6 (s) = 2 and X k (s) = 3 for 7 < k < 18. The first three values of T k (s) 
are consequently Ti(s) = 2, T 2 (s) = 5, T 3 (s) = 7. 

Moreover, the random variable T k (s) can be decomposed as follows, 

k 

T k {s) = Y,Zr(s), (5) 

r=l 



where Z r (s) = T r (s) — T r _i(s) is the number of letters to read before the branch that corresponds to 
s increases by 1. In what follows, Z r (s) can be viewed as the waiting time n of the first occurence of 
s( r ) in the sequence 

i.e. Z r (s) can also be defined as 

Z r (s) = min{n > 1, U n+Tr _ l{s) . . . U n+Tr _^ s )_ r+1 = s ± . . . s r }. 

Because of the Markovianity of the model, the random variables Z r (s) are independent. 

Let us then introduce Y r (s) as being the waiting time of the first occurence of in the sequence 



• ^n+T r ._i(s) C/ n-l+T T ._i(s) ■ ■ ■ ^1+T r _i(s)> 
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that is to say 

Y r (s) = min{n > r, U n+Tr _^ s ) . . . U n+Tr _ l{s) _ r+1 = s 1 . . . s r }. 

One has readily the inequality Z r (s) < Y r (s). More precisely, if the word is inserted in the sequence 
before time T r _i(s) + r, there is some overlapping between prefixes of s( r_1 ) and suffixes of s^. See 
Figure 5 for an example where r = 6 and S1S2S3 = S4S5SQ. Actually, variables Z r (s) and Y r (s) are 
related by 

Z r (s) = t{ Zr ( s ) <r yZ r (s) + l{Z r (s)>r}^r(s)- 

Since the sequence (U n ) n >i is stationary, the conditional distribution of Y r (s) given T r _i(s) is the 
distribution of the first occurence of the word s^ r ' in the realization of a Markov chain of order 1, 
whose transition matrix is Q and whose initial distribution is its invariant measure. In particular the 
conditional distribution of Y r (s) given T r _i(s) is independent of T r _i(s). 

The generating function &(s( r \t) ^ K[t Yr ^] is given by Robin and Daudin [2(|: 

*( s w,t) = ( 7r (t) + (i-t)<yr(*" 1 ))" 1 . (6) 

where the functions 7 and 5 are respectively defined as 

^) d =^EQ m (^r)t m , Sr(t-1)*± ^^^ } , (7) 
and where Q m (u,v) denotes the transition probability from u to v in m steps. 

Remark 2.1. In the particular case when the sequence of nucleotides (U n ) n >i is supposed to be 
independent and identically distributed according to the non degenerated law (pa,Pc,Pg,Pt), the 
transition probability Q m (si, s r ) is equal to p(s r ), and hence 7 r (i) = 1. 

Proposition 2.2. (i) The generating function of Y r (s) defined by has a ray of convergence 
> 1 + Kp(s^) wh ere k is a positive constant independent of r and s. 
(ii) Let 7 denote the second largest eigenvalue of the transition matrix Q. For all t s] — 7 , 7 -1 [, 

|7r(*) - 1| < 7^— Sr«', (8) 
1 — 7|t| 

where k' is some positive constant independent of r and s (if "y = or if the sequence is i.i.d., 
we adopt the convention 7™ 1 = +00 so that the result remains valid). 

Proof. The proof of Proposition 12.21 is given in Appendix^! D 

3. Length of the branches 

In this section we are concerned with the asymptotic behaviour of the length £ n (resp. C n ) of the 
shortest (resp. longest) branch of the CGR-tree. 



Theorem 3.1. 



Cm a . « 1 _ J~L n no 1 



TITLE WILL BE SET BY THE PUBLISHER 



9 



According to the definition of X n (s), the lengths £ n and C n are functions of X n : 

£ n = min X n (s), and C n = max X n (s). (9) 

The following key lemma gives an asymptotic result on X n (s), under suitable assumptions on s. Our 
proof of Theorem 13. II is based on it. 

Lemma 3.2. Let s be such that there exists 

1 



lim - In I . -, I d = h(s) > 0. (10) 
-^+oo n Vp(« (n) ) 

X n (s) 



Then 



a.s. 



Inn n^oo h(s) 



Remark 3.3. Let v == vv . . . consist of repetitions of a letter v. Then X n (v) is the length of the 
branch associated with v in T n . For such a sequence (and exclusively for them) the random variable 
Yk(v) is equal to T k (v). Consequently X n {y) is the length of the longest run of V in U\ . . . U n . When 
{U n )n>i is a sequence of i.i.d. trials, Erdos and Revesz Erdos and Revesz 0], Petrov 0] showed 
that 

X n (v) a .s 1 

In n n->oo In - ' 
P 

where p = P(C/j = v). This convergence result is a particular case of Lemma 13.21 

Simulations. In a first set of computations, two random sequences whose letters are i.i.d. were 
generated. On Figure 6, in the first graph, letters are equally-likely drawn; in the second one, they are 
drawn with respective probabilities (pa,Pc,Pg,Pt) = (0.4, 0.3, 0.2, 0.1). On can visualize the dynamic 
convergence of £ n /lnn, ^ n /lnn and of the normalized insertion depth D n /\nn (see section^) to their 
respective constant limits. 

Figure 7 is made from simulations of 2,000 random sequences of length 100, 000 with i.i.d. letters 
under the distribution (pa,Pc,Pg,Pt) = (0.6,0.1,0.1,0.2). On the x-axis, respectively, lengths of the 
shortest branches, insertion depth of the last inserted word, lengths of the longest branches. On the 
y-axis, number of occurences (histograms). 

Proof of Lemma VJ.'A Since X n (s) = k for n = Tfc(s) (see Equation @), by monotonicity arguments, 
it is sufficient to prove that 

In T k (s) a.s. , 
• h(s). 



a.s. 

k— >oo 



Let e r (s) == Z r (s) — E [Z r (s)], so that Tk(s) admits the decomposition 

k 

T k (s)=E[T k (s)}+Y,£r(s). 

r=l 

If (M k (s))k is the martingale defined by 

k 



r=l 
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taking the logarithm in the preceding equation leads to 



lnT fc (*) = InE [T k (s)] + In ( 1 + ^j^j) • (H) 



• It is shown in Robin and Daudin |26| that E [Z n (s)] = 1/p (s( n )) so that the sequence ^ InE [Z n (s)] 

converges to h(s) as n tends to infinity (h(s) is defined by (|10|)). Since E[Tfc(s)] = Ylr=l ^ [^'( s )] 
(see ©), the equality 

lim j]nE[T k (8)]=h(8) 

is a straightforward consequence of the following elementary result: if {x k ) k is a sequence of positive 
numbers such that lim^oo r In (x^) = /i > 0, then lirm ; _ >00 \ In (X^r=i x r ) = h. 



• The martingale (Mfc(s))fc is square integrable; its increasing process is denoted by ^(M(s)) 
Robin and Daudin |2f| have shown that the variance of Z r (s) satisfies V [Z r {s)\ < 4r/p («( r )) , so that 

(M(s)) k = O (ke 2kh{ - s) 



One can thus apply the Law of Large Numbers for martingales (see Duffo [8| for a reference on the 
subject): for any a > 0, 



M k ( S ) = 0[(M(s))l /2 (ln(M( S )) k )^ 

Consequently, 



a.s. 



M k {s) 



O ( k 1+a / 2 



a.s. 



K[T k (s)] 

which completes the proof of Lemma 21 □ 



Proof of Theorem \, '-j.il It is inspired from Pittel |22| • Clearly the definition given in Equation Q 
yields 

£ n <X n (s + ) and C n > X n (s_) 
(definitions of s+ and s_ were given in (0))- Hence, by Lemma 13,21 



hmsup- < - — , hmmi > - — a.s. 

rwoo mn h + n— >oo Inn h- 

• Proof for l n 

For any integer r, 

P(4<r-1)< P(X n (s) < r - 1) < p ( r r(s)>n), (12) 

where the above sums are taken over the set A r of words with length r (for a proper meaning of 
this formula, one should replace s by any infinite word having as prefix, in both occurences). 
We abuse of this notation from now on. Since the generating functions ^(s-^i) are defined for any 
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1 < t < min{7~ 1 , 1 + Kp(s^)} and j < r (see Assertion i) in Proposition I2.2j) . each term of the sum 
(|12)l can be controlled by 



F(T r (s) >n)< r n E[^( s »] < t- n \\^(s {j \t). 

j=i 

In particular, bounding above all the overlapping functions 1{ S ... sl = Sr ... Sr _ +1 ) by 1 in (J7J), we deduce 
from (jHJ) and from Assertion ii) of Proposition 12,21 that 

(s)£n)£r «n( 1+(1 _,(t_^ + ^_))" 

Let < e < 1. There exists a constant C2 G]0, 1[ depending only on e such that 

p(s^) > C20P , with a = exp(— (1 + e 2 )h + ) 
(for the sake of brevity c, c\ and C2 denote different constants all along the text). We then have 

ww >„)< t -n(i + (i- t ,(L^ + T ^^ 

Choosing t = 1 + C2na r ', Inequality (JBJ) is valid if r is large enough, so that 

prr (s) >n)< a- n T](i- lzo r-* aj - ( - 1 + C2Kar) ~ j aTc2KK ' 

v rw - ' ~ -L_M a(l + C2Ka r ) - 1 1- 7(1 + c 2 Ka r 



3=1 

Moreover since obvioulsy 



a 3 — (1 + C2Ka r ) J 

iim 



i^oo a(l + C2Ka r ) — 1 1 — a' 

and C2Kk'/{1 — 7(1 + C2Ka r )) is uniformly bounded in r, there exist two positive constants A and L 
independent of j and r such that 



1 

r-3 ' 



P(T r (s) > n) < (1 + c 2 Ka r )- n L J]7 1 - Aa 

3=1 

In addition, the product can be bounded above by 

Y[ (l - Xa r - j ^j < Yl (l - Aa j ) = R < 00. 

j=l j=0 

Consequently, 

P(T r (s) > n) < Li?(l + c 2 KaT"' 
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For r = — £)t^J anc ^ e smau enough, there exists a constant R' such that 

P(T r (s) > n) < J R / exp(-c 2 Kn 9 ), 
where 9 = e — e 2 + e 3 >0. We then deduce from l(T2*|) that 

P(4 < r - 1) < 4 r ^'exp(-c 2 Kn e ), 
which is the general term of a convergent series. Borel-Cantelli Lemma applies so that 

liminf- > - — a.s. 

n-*oc In n h + 

• Proof for C n 

To complete the proof, one needs to show that 

C-n 1 
hmsup- < - — a.s. 

n^oo m n h- 

Again, since X n (s) = k for n = T k (s), by monotonicity arguments it suffices to show that 

r ■ r ■ lnT k( s ) ^ , 
limmt mm > /i_ a.s. 

(notations of (|12|l ). 

Let < e < 1. As in the previous proof for the shortest branches, it suffices to bound above 

min T k (s) < e kh - {1 - £) 

by the general term of a convergent series to apply Borel-Cantelli Lemma. Obviously, 

f( min T k (s) < e kh - {1 - e A < V P (tJs) < e kh -^- £ ^ 

s (fc) g _4fe 

If t is any real number in ]0, 1[ and if n = exp(fe/i_ (1 — e)), 

p (r k ( s ) < e kh -^- £ A = p (t Tkis) > e 

and the decomposition ©, together with the independence of the Z r (s) for 1 < r < k, yield 

k 



p(t Tk ^ > tA < r"]]E[t 

The proof consists in bounding above 



r=l 



k 

r"[[E[^( s )] 

s (/c) g _4fc r=l 
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by the general term of a convergent series, taking t of the form 

^(l + c/n)" 1 

so that the sequence (t n ) n is bounded. 

The generating function of Z r (s) is given by Robin and Daudin 26] and strongly depends on the 
overlapping structure of the word s^ r \ As < t < 1, this function is well defined at t and is given by 
(see Assertion i) of Proposition 12. 2|) 



E[AWl =1 Q 9 (14) 

where j r (t) and 5 r (t) are defined in (|7|). Moreover, from Assertion ii) of Proposition 12.21 it is obvious 
that there exists a constant 6 independent of r and s such that, 

7r (t) < 1 + 0(1 -t). (15) 

Besides, by elementary change of variable, one has successively 

r t r v(s^) 

^\ / ' V / / / l s r---Sr-m+l— Sm---Sl / £, m / ( m )\ 

m=l V / 

^{s r ...s m =s r ^ m ^... Sl Y t (r-m+l)\ 

m=l *\ / 



^-{Sr ...Sm — &r — wt-|-l---®l \ ^ 

m=l 

When m is large enough, /i^.'s definition implies that 



P\ b m j 



p(s {m) ) < /3 m , where = exp(-(l - e 2 )/*-), 
so that there exists positive constants p and c such that, for any r, 



<*F and fp^Wjirr 1 ) < 1 + Ik..^^.^}^. (16) 



m=2 



Thus Formula (|14() with inequalities (|15[) and (|16f) yield, for any r < k, 



E[t z ^\ < 1 ? \ , (17) 

cP r (^- t +e) + l+q k (s) 



where qk(s), that depends on the overlapping structure of s^ k \ is defined by 

r 

?*(*) d = /\ ma ?,. 2 1 {^... Sm = Sr - m+ i...s 1 }/? m - 



m=2 
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Note that whatever the overlapping structure is, qk(s) is controlled by 

P 



< q k {s) < 



1-0- 



(18) 



Thus, 



jj E [ t M«)] < eX p -J^ln(l 



r=l 



r=l 



C) gr((i-t)-i + e) + i + gfc (s) 

Since the function x t— ► In 1/(1 — x) is increasing, comparing this sum with an integral and after the 
change of variable y = c(3 x ((l — i) -1 + 6), one obtains 



c((l-t)-i+0) 



rr E [t Zr ^] < ex P -— — / 



[ dy 

In/?" 1 Jc0>>((i-t)-i+o) y + l + g fe (s)y y 



In 1 



This integral is convergent in a neighbourhood of +oo, hence there exists a constant C, independent 
of k and s such that 



TW^)]<Cexp[-— / + °° Infl ^tV 1 - 



(19) 



The classical dilogarithm Li 2 (z) = X^fe>i z k /k 2 , analytically continued to the complex plane slit along 
the ray [1, +oo[, satisfies ^ Li 2 (— |) = | log(l + v/y). This leads to the formula 

r m (i '—rr) " * = u 2 v Li2 (_i±*(£n 

Ja fc V y + ! + %(«)/ y \ a k j \ a k j 



with the notation afc = c/3 fc ((l — i) 1 + 6). Choosing t = (1 + c/n) 1 yields readily 



afe ~ exp(— kh_(e — e )). 



Moreover, in a neighbourhood of — oo, 

Li 2 (*) = -iln 2 (-*)-C(2) + 0(.L), 

and the function Li 2 (x) + ^ ln 2 (— x) is non-decreasing on ] — oo, 0[, so that 



(20) 



(21) 



Li 2 (x)>-iln 2 (-x)-C(2) 
Li 2 (x)<-iln 2 (-x)-^ 



(x < 0) 
(x < -1), 



(22) 



noting that Li 2 (— 1) = 

y in ( 



C(2) 



Hence, if k is such that a k < 1, 



1 



y + ! + y 



-i 



dy 



> Li 2 - 



gfc(g) 



+ Iln 2 (a fc ) + ^ 



(23) 
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with lnafc being asymptotically proportional to k because of (|20|). Thus, the behaviour of the integral 
in (fl9|) as k tends to +00 depends on the asymptotics of qk(s). 

Let Zk = exp(— The end of the proof consists, for a given k, in splitting the sum (fl3|) 
into prefixes that respectively satisfy qk(s) < exp(— \/k) or qk{s) > exp(— Vk). These two cases 
correspond to words that respectly have few or many overlapping patterns. The choice Zk = exp(— yk) 
is arbitrary and many other sequence could have been taken provided that they converge to zero with 
a speed of the form exp[— o(k)]. 

First let us consider the case of prefixes such that qk{s) < exp(— For such words, (|22j) 
and (J22J) imply that 



In 1 r- 

a k V y+l + 9fe(») 



\a k J 2 2 



y 2 

the second member of this inequality being, as k tends to infinity, of the form 

kVkh_(e-e 2 ) + 0(k). 

Consequently, 

k r 

e 



rj EY z,i,i] < exp 



r=l 



1 + e 



k^ 2 + 0{k) 



There are 4 fc words of length k, hence very roughly, by taking the sum over the prefixes such that 
qk(s) < Zk, and since t~ n is bounded, the contribution of these prefixes to the sum (J13|) satisfies 



J2 r n Y[E[t z ^] < 4 fc exp 

s( k )eA k , q k (s)<z k 



r=l 



1 + e 



k 3/2 + 0{k) 



which is the general term of a convergent series. 

It remains to study the case qk{s) > Zfc. For such words, let us only consider the inequalities (fl8|) 
and (fT9|) that lead to 



r=l 



ln/3- 



1 



ln( 1 



y 



Since x < log(l — x) , after some work of integration, 



Yl E[t z ^} < exp(--^-k + o(k) 



(24) 



The natural question arising now is: how many words s( fc ) are there, such that qk(s) > Zk ? Let us 
define 
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The definition of qk(s) impiies clearly that 



r _ 
E k C {,(*>, 3r < k, p t {Sr .., m=Sr - m+1 .., l} P m > e'^}. 



m=2 

For any r < k and x > 0, let us define the set 



s r ( X ) d = f {,« j2 *w...s m =s r - m+1 .., l} r < x}. 



m=2 



For any I £ {2, . . . , r}, one has the following inclusion 

Pl |s^> l{s r ...s TO =s P _ m+1 ...«i} = 0| C S'r ^ - — 
m=2 ^ 

If the notation B c denotes the complementary set of B in A k , 

( P £+1 \ 1 / ( fc ) \ 

^ m=2 

Since = p(3 e+1 (l - (3)- 1 for 1 = Vk/hi (/T 1 ) + In (^(l-/?)) /In/3, 

fe IAI+i 

r=l m=2 

so that the number of words such that qk(s) > 2& is bounded above by 

k W+i , _ , 

* E k < Yl Yl 4m_1 G O^^M" . (25) 

r=l m=2 

Putting (|24jl and (|25|) together is sufficient to show that the contribution of prefixes such that 
Qk( s ) > ^fc t° the sum (|T3|) . namely 

is the general term of a convergent series too. 

Finally, the whole sum l|13j) is the general term of a convergent series, which completes the proof of 
the inequality 

Cn 1 
hmsup- < - — a.s. 

n^oo m n h- 

□ 



TITLE WILL BE SET BY THE PUBLISHER 17 

4. Insertion depth 

This section is devoted to the asymptotic behaviour of the insertion depth denoted by D n and to 
the length of a path randomly and uniformly chosen denoted by M n (see section |2J) . D n is defined as 
the length of the path leading to the node where W(n) is inserted. In other words, D n is the amount 
of digits to be checked before the position of W(n) is found. Theorem 13.11 immediately implies a first 
asymptotic result on D n . Indeed, D n = £ n whenever £ n +i > £ n , which happens infinitely often a.s., 
since linin^oo £ n = oo a.s. Hence, 

D f 1 
lim ml = lim ml = - — a.s. 

n-+oo Inn re— >oo hm h + 

Similarly, D n = C n whenever £ n +i > £ n , and hence 

D C 1 
limsup = limsup = - — a.s. 

n ~>oo Inn n ^oo Inn /l_ 

Theorem 14. II states full convergence in probability of these random variables to the constant 1/h. 
Theorem 4.1. 

D n P 1 M n P 1 
> — ana lim > — . 

In n n— »oo h n— s-oc In n n-^oo h 

Remark 4.2. For an i.i.d. sequence XJ = U1U2 ■ . ., in the case when the random variables Ui are not 
uniformly distributed in {A,C,G,T}, Theorem 14. II implies that does not converge a.s. because 

lim sup > — > - — = lim mi . 

n -»oo mn h n,+ n-»oo Inn 

Proof of Theorem \4-l\ It suffices to consider D n since, by definition of M n , 

1 n 

P(M n = r) = ~J2w(D v = r). 

v=\ 

Let e > 0. To prove Theorem 14. 1( we get the convergence linin^oo F(A n ) = 0, where 

D n 1 



Inn h 

by using the obvious decomposition 

nA n ) = v(^> 1 -p) + F(^< 1 - £ 



Inn h J \lnn h 
• Because of X n 's definition (j3J), 

D n = X n _ 1 (W(n)) + l 
so that the duality © between X n (s) and Tfc(s) implies that 



D a ^ l+£\ ^ p(x n .. 1 (W(n)) >k-lj < p(r fc _i(W(n)) < n - l) (26) 



Inn /i 
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with k = f \}-^- lnraj . Furthermore, 

p(T fc _i(W(n)) <n-l) <p({T fc _i(W(n)) <n-l}nB^)+p(5^ 
where -E>ri,fc is defined, for any ko < n, by 



ko<j<n 



j \p(W(n)W) 



< s 2 h 



}■ 



Since the sequence E7 is stationary, W(W(n)^) = P(t/w) so that Ergodic Theorem implies 



lim — In 



h a.s. 



which leads to P(Z? ni fc ) = 1 when both fco and n are large enough. If iS n ,fc denotes the set of words 



def 



'n,fco 



s (B) e^ ViG{fc ,...,n} 



-In 

j \p[s 



or 



when ko and n are large enough, 



7fc_i(iy(n)) < n - lj < P(^H (n) = s (n) , r fc _i(s) < n - 1 

< P(r fc _i(s)<n-l). 

Such a probability has already been bounded above at the end of Theorem 13. li s proof; similarly, 

P(71-i(s) <n-l)=OUexp(- j^-n + — ln 1 

so that (|26|) and (|27|) show that P > ^lir) tends to zero when n goes off to infinity. 

• Our argument showing that P < ^tp) tends to zero when n tends to infinity is similar. If now 

dcf 



e 2 )h 



(27) 



k= L^lnnj, 



so that 



D 1 N 

t^^^) <P(X n _!(W(n)) <fc-l) =P(T fc (W(n)) >n), 



1 '" < 1 < P( {T fc (W(n)) > n} n S n>feo ) +P(5^J. 



Inn /i 
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As before, P(-B£ = when ko and n are large enough and 



T k (W(n)) > nj < Yj P(W(") (n) = s(n) > T fc( s ) > n 
< J2 P ( T fc(s)>n). 

Like in the proof of Theorem 13.11 on shows that 

F ( T k( s ) >n)=0 (Vexp (-ku 6 ' /2^j 

which implies that P CP"*- < ^rp) tends to zero when n tends to infinity. The proof of Theorem 14. II is 
complete. □ 
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Appendix A. Domain of definition of the generating function $(s (r) ,t) 
A.l. Proof of Assertion ii) 

There exists a function K(si,s r ,m) uniformly bounded by the constant 

K = f sup \K(si, s r ,m))\ 

such that 

Q m (si,s r ) - p(s r ) = K( Sl ,s r , m)-f m , 
where 7 is the second eigenvalue of the transition matrix. Consequently, 



tp y Sr ) m>l 

7K ll-tl 



mm u p(u) 1 - -y 1*1 ' 



Hence Assertion ii) holds with k' = f jK/ mm u p(u). 

A. 2. Proof of Assertion i) 

On the unit disc \t\ < 1, the series 

S(t)^^Q m ( Sl ,s r )f' 

m>l 

is convergent and one has the decomposition 



J2 Q m (si,S r )t m = 1 + — — [Q m (si,Sr) -p(s r )]t m . 



The function 

m>l 

is analytically continuable to the domain 7|£| < 1, and then the series 



-3-4^Q m ( Sl , Sr )f 



1 -t 

converges on the same domain. One has to determine the zeroes of 

dcf 



. lP ( a W) 



+ (1 - *) [1 + E 1 ^ Hsr... Sj = Sr . j+1 .., l} ] ■ 
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Assuming that some < t < 1 were a real root of D(t), then 

• < ^SF^^ 

= (t-i)[i+ E ^- 1 ^i {s ,... s ,= Sr _ J+ ,.. sl} ] < o. 

It is thus obvious that there are no real root of D(t) in ]0, 1[. Moreover, one can readily check that 
and 1 are not zeroes of D(t). We now look for a root of the form t = 1 + e with e > 0. Such an e 
satisfies 

(1 + sYp(s^) (l - p(sr) £ {1+£) E *« *r) " p(*r)]) 

£ = ^ 



1 + E(! + e) J - 1 -^l { . P .... J =^_ J+1 ....i} 

so that 

This implies that $(s( r ^,t) is at least defined on [0, 1 + Kp(s^) [. This implies the result. 
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Figure 3. Representation of 16 nucleotides of Mus Musculus GAGCACAGTG- 
GAAGGG in the CGR-tree (on the left) and in the "historyless representation" (on 
the right). 

G C 




t a 



Figure 4. Chaos Game Representation (on the left) and historyless representation 
(on the right) of the first 400000 nucleotides of Chromosome 2 of Homo Sapiens. 
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U 3+T 5 (s) 


U 2+T B (s) 


U l+T 5 (s) 


Si 


S2 


S3 


SA 


S5 






Si 


S2 


S3 


S 4 


S5 


S<5 





Figure 5. How overlapping intervenes in Z r {sY definition. In this example, one takes 
r = 6. In the random sequence, prefix can occur starting from U 3+Ts ^ only if 
S1S2S3 = s 4 s 5 s 6 . 



TITLE WILL BE SET BY THE PUBLISHER 



ii ii ' p-iii" *...■> v i 
v»: xi. «.vlir - 1 j 
■ ■-■I ^ ■. ii ■ ) . 




: i>-i:fI Il I- ii - \ WH - 
L T-: j^crjrl ■Lh'tc i: fc n 1 :! 1 ■" I ■ v 




■ "J l ■ u f i a j :■ .■. ' u A 



Figure 6. Simulations of two random sequences. On the first graphic, letters of the 
sequence are i.i.d. and equally likely distributed; on the second one, i.i.d. letters have 
probabilities (pa,Pc,Pg,Pt) = (0.4,0.3,0.2,0.1). On the x-axis, number n of inserted 
letters; on the y-axis, normalized insertion depth D n /1nn (oscillating curve), lengths 
of the shortest and of the longest branch (regular "under" and "upper envelops"). The 
horizontal lines correspond to the constant limits of these three random variables (on 
the first graph, these three limits have the same value). 
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Shortest and longest branches 
[lengths 00000, 2000 exps, probs=0.6,0.1 ,0.1,0.2] 



Experiment sho-.est branches c 
Experimental longest branches l 
Experimental Dn i 
Theorical shortest b.'anch=5.000000 
Theorical longest branch=22.537878 
Theorical Dn=10.572987 



Figure 7. Simulations of 2000 sequences of 100,000 i.i.d. letters. On the left, his- 
togram of shortest branches; in the middle, histogram of insertion depth of the last 
inserted word; on the right, histogram of longest branches. Vertical lines are their 
expected values, namely ln(10 5 ) x I where I respectively equals the limit of £ n /lnn, 
D n /lnn and £ n /lnn. 



